Central driver implementation#13171
Conversation
There was a problem hiding this comment.
Pull request overview
This PR is a backup branch for the “central driver POC” work, transitioning KFP v2 driver execution from per-component driver pods to an Argo Workflow executor plugin + sidecar model, plus associated manifests, RBAC, CI, tests, and UI support for persisted driver logs.
Changes:
- Update Argo-compiled workflow “golden” YAMLs to use
plugin: driver-plugintemplates and add hardened security contexts. - Add/patch Kubernetes manifests (kustomize overlays, RBAC, workflow-controller settings) to enable Argo executor plugins and configure the driver plugin sidecar via ConfigMaps.
- Extend backend + frontend to persist driver logs (via MLMD custom properties) and surface them in the UI; adjust tests and CI scripts accordingly.
Reviewed changes
Copilot reviewed 124 out of 209 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| test_data/compiled-workflows/mixed_parameters.yaml | Switch driver execution from container driver pod to driver-plugin and add securityContext hardening in golden YAML. |
| test_data/compiled-workflows/metrics_visualization_v2.yaml | Same as above for metrics visualization golden YAML. |
| test_data/compiled-workflows/lightweight_python_functions_with_outputs.yaml | Same as above for lightweight python functions (outputs) golden YAML. |
| test_data/compiled-workflows/lightweight_python_functions_pipeline.yaml | Same as above for lightweight python functions pipeline golden YAML. |
| test_data/compiled-workflows/iris_pipeline_compiled.yaml | Same as above for iris pipeline golden YAML. |
| test_data/compiled-workflows/if_else_with_oneof_artifacts.yaml | Same as above for if/else (oneof artifacts) golden YAML. |
| test_data/compiled-workflows/if_elif_else_with_oneof_parameters.yaml | Same as above for if/elif/else (oneof parameters) golden YAML. |
| test_data/compiled-workflows/identity.yaml | Same as above for identity golden YAML. |
| test_data/compiled-workflows/hello_world.yaml | Same as above for hello world golden YAML. |
| test_data/compiled-workflows/flip_coin.yaml | Same as above for flip coin golden YAML. |
| test_data/compiled-workflows/embedded_artifact.yaml | Same as above for embedded artifact golden YAML. |
| test_data/compiled-workflows/dict_input.yaml | Same as above for dict input golden YAML. |
| test_data/compiled-workflows/container_io.yaml | Same as above for container IO golden YAML. |
| test_data/compiled-workflows/concat_message.yaml | Same as above for concat message golden YAML. |
| test_data/compiled-workflows/component_with_metadata_fields.yaml | Same as above for component metadata fields golden YAML. |
| test_data/compiled-workflows/artifact_crust.yaml | Same as above for artifact crust golden YAML. |
| test_data/compiled-workflows/artifact_cache.yaml | Same as above for artifact cache golden YAML. |
| test_data/compiled-workflows/arguments_parameters.yaml | Same as above for arguments parameters golden YAML. |
| test_data/compiled-workflows/add_numbers.yaml | Same as above for add numbers golden YAML. |
| manifests/kustomize/third-party/argo/installs/namespace/workflow-controller-argo-role-patch.json | Patch Workflow Controller role to add WorkflowTaskSet status permissions for agent/plugin flow. |
| manifests/kustomize/third-party/argo/base/workflow-controller-deployment-patch.yaml | Enable Argo executor plugins via env vars. |
| manifests/kustomize/third-party/argo/base/workflow-controller-argo-taskset-clusterrole-.yaml | Add cluster-scoped RBAC for WorkflowTaskSet access needed by Argo agent/plugin. |
| manifests/kustomize/third-party/argo/base/kustomization.yaml | Include the new WorkflowTaskSet RBAC resource. |
| manifests/kustomize/third-party/application/application-controller-deployment.yaml | Increase application-controller memory request. |
| manifests/kustomize/env/dev/kustomization.yaml | Pin kfp-driver image tag and add dev overlay patch for driver plugin ConfigMap. |
| manifests/kustomize/env/dev/driver-plugin-cm-path.yaml | Dev overlay driver plugin sidecar ConfigMap (image + env + resources + securityContext). |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-ui-deployment.yaml | Update TLS secret reference to Argo agent CA secret name. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-scheduledworkflow-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-persistenceagent-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-driver-plugin-cm.yaml | TLS overlay patch for driver plugin sidecar ConfigMap (+ cert mount). |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/ml-pipeline-apiserver-deployment.yaml | Update CABUNDLE secret name + volume secret for TLS. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-writer-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-grpc-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/patches/metadata-envoy-deployment.yaml | Same TLS secret rename. |
| manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml | Apply driver plugin ConfigMap patch in TLS overlay. |
| manifests/kustomize/env/cert-manager/base-tls-certs/kfp-api-cert.yaml | Rename cert secret to match Argo agent’s required secret name. |
| manifests/kustomize/base/pipeline/pipeline-runner-sa.yaml | Add explicit SA token Secret for pipeline-runner. |
| manifests/kustomize/base/pipeline/pipeline-runner-role.yaml | Expand RBAC for WorkflowTaskSet resources and add executor-plugin service account + roles/bindings. |
| manifests/kustomize/base/pipeline/ml-pipeline-driver-plugin-cm.yaml | Base driver plugin sidecar ConfigMap (ExecutorPlugin type). |
| manifests/kustomize/base/pipeline/kustomization.yaml | Add driver plugin ConfigMap to base resources. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.py | Add Kubernetes client usage to provision executor-plugin SA and RBAC per namespace. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/pipelines-profile-controller-admin.yaml | Add admin SA + ClusterRole/Binding for profile controller to manage SA/RBAC. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/kustomization.yaml | Include the new admin RBAC resource. |
| manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/deployment.yaml | Run profile controller with admin SA; install Python deps at runtime. |
| frontend/src/hooks/queryKeys.ts | Add React Query key for driver/system logs. |
| frontend/src/components/tabs/RuntimeNodeDetailsV2.tsx | Add “System Logs” tab and fetch driver logs artifact via API. |
| backend/test/testutil/kubernetes_utils.go | Formatting cleanup. |
| backend/test/end2end/utils/e2e_utils.go | Capture workflow-controller logs for failed runs and handle agent pod logs. |
| backend/test/end2end/e2e_suite_test.go | Formatting cleanup. |
| backend/test/compiler/utils/workflow_utils.go | Ensure cache settings updates also apply to the driver plugin args. |
| backend/src/v2/metadata/client.go | Persist driver_logs_uri (+ store session info) in execution custom properties. |
| backend/src/v2/driver/root_dag.go | Use context-aware logger and return pipeline from RootDAG. |
| backend/src/v2/driver/k8s_test.go | Ensure tests provide a logger via context. |
| backend/src/v2/driver/k8s.go | Replace glog with context logger; require logger in context for several paths. |
| backend/src/v2/driver/driver_test.go | Ensure tests provide a logger via context. |
| backend/src/v2/driver/driver.go | Change output URI prefix naming used for task-root generation. |
| backend/src/v2/driver/dag.go | Avoid re-fetching pipeline in DAG driver; use context logger. |
| backend/src/v2/driver/container.go | Pass output prefix; write DriverLogURI custom property; use context logger. |
| backend/src/v2/compiler/argocompiler/plugin.go | Add helper to build Argo executor plugin template JSON. |
| backend/src/v2/compiler/argocompiler/dag.go | Switch DAG driver template to Plugin with args map; adjust call sites for error handling. |
| backend/src/v2/compiler/argocompiler/container_test.go | Skip env var test that no longer applies under plugin-based execution. |
| backend/src/v2/compiler/argocompiler/container.go | Switch container driver template from Container to Plugin, including JSONPath outputs. |
| backend/src/v2/compiler/argocompiler/argo.go | Disable Istio sidecar injection on workflow pods; remove driver image/command from compiler state. |
| backend/src/v2/cmd/driver/execution_paths.go | Remove old v2 driver execution paths file (moved). |
| backend/src/driver/main_test.go | Update test expectations for renamed symbols and parsing function name. |
| backend/src/driver/main.go | New driver plugin HTTP server entrypoint and shared helpers. |
| backend/src/driver/execution_paths.go | Re-add ExecutionPaths struct under new driver plugin package. |
| backend/src/driver/api/response.go | Add HTTP response DTOs for driver server/plugin execution. |
| backend/src/driver/api/request.go | Add HTTP request DTOs for driver server/plugin execution. |
| backend/src/common/util/context_logger.go | Introduce context logger (logrus) writing to stdout + file. |
| backend/Dockerfile.driver | Build the new driver plugin binary from backend/src/driver. |
| .github/workflows/legacy-v2-api-integration-tests.yml | Update CA secret name used by CI. |
| .github/workflows/e2e-test.yml | Update CA secret name used by CI. |
| .github/workflows/api-server-tests.yml | Update CA secret name used by CI. |
| .github/resources/scripts/kfp-readiness/wait_for_pods.py | Exclude *-agent pods from readiness filter. |
| .github/resources/scripts/collect-logs.sh | Add agent container log collection and describe workflows. |
| .github/resources/manifests/standalone/tls-enabled/kustomization.yaml | Patch driver plugin ConfigMap in TLS-enabled GH manifests. |
| .github/resources/manifests/standalone/tls-enabled/driver-plugin-cm-path.yaml | Provide driver plugin sidecar ConfigMap for TLS-enabled GH manifests. |
| .github/resources/manifests/standalone/proxy/kustomization.yaml | Patch driver plugin ConfigMap in proxy GH manifests. |
| .github/resources/manifests/standalone/default/kustomization.yaml | Patch driver plugin ConfigMap in default GH manifests. |
| .github/resources/manifests/standalone/cache-disabled/kustomization.yaml | Patch driver plugin ConfigMap in cache-disabled GH manifests. |
| .github/resources/manifests/standalone/cache-disabled-proxy/kustomization.yaml | Patch driver plugin ConfigMap in cache-disabled-proxy GH manifests. |
| .github/resources/manifests/multiuser/default/kustomization.yaml | Patch driver plugin ConfigMap in multiuser GH manifests. |
| .github/resources/manifests/multiuser/cache-disabled/kustomization.yaml | Patch driver plugin ConfigMap in multiuser cache-disabled GH manifests. |
| .github/resources/manifests/multiuser/artifact-proxy/kustomization.yaml | Patch driver plugin ConfigMap in artifact-proxy GH manifests. |
| .github/resources/manifests/kubernetes-native/default/kustomization.yaml | Patch driver plugin ConfigMap in kubernetes-native GH manifests. |
| .github/resources/manifests/base/driver-plugin-cm-path.yaml | Base GH manifest overlay for driver plugin sidecar ConfigMap. |
| command: | ||
| - sh | ||
| - -c | ||
| - | | ||
| python3 -m venv /tmp/venv && \ | ||
| . /tmp/venv/bin/activate && \ | ||
| pip install --no-cache-dir kubernetes boto3 && \ | ||
| python /hooks/sync.py |
There was a problem hiding this comment.
Installing dependencies at container startup is operationally risky:
- Requires outbound network access to PyPI (often blocked in clusters).
- Adds cold-start latency and makes behavior non-reproducible.
- Can fail if
python3,venv, orpipare not present in the base image.
Fix (required): bake kubernetes + boto3 into the image used for the profile controller (or switch to an image that already contains them), and run python /hooks/sync.py directly.
There was a problem hiding this comment.
The metacontroller and pipelines profile controller is different. PPC only provides a json to metacontroller and metacontroller handles the rest.
There was a problem hiding this comment.
There was a problem hiding this comment.
| return execution, err | ||
| } | ||
| ecfg.TaskName = opts.TaskName | ||
| ecfg.Namespace = opts.Namespace |
There was a problem hiding this comment.
ExecutionConfig (in backend/src/v2/metadata/client.go) does not show a Namespace field in the diff, so ecfg.Namespace = opts.Namespace is very likely a compile-time error.
Fix (required): either:
- add a
Namespacefield tometadata.ExecutionConfigand plumb it through (if needed), or - remove this assignment if namespace is already available via
opts.Namespace/ pipeline context.
| ecfg.Namespace = opts.Namespace |
1ad5ac7 to
de96fee
Compare
|
/retest |
2 similar comments
|
/retest |
|
/retest |
de96fee to
0a4db0b
Compare
|
/retest |
da711da to
93b2f1a
Compare
There was a problem hiding this comment.
Hello, I am really concerned about the sync.py changes. Why do you want to create new serviceaccounts? Pipelines usually run as default-editor in Kubeflow. Usually such roles should be aggregated to kubeflow-edit. Adding attidional service accounts i really want to avoid. That will probably conflict with the average multi-user Kubeflow and create technical debt.
There was a problem hiding this comment.
i saw this patch in multiple places. Can we consolidate ?
There was a problem hiding this comment.
I added a few comments. The PipelineProfileController+Metacontroller architecture is different than what you expect your code does not match it.
See https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/decorator-controller.yaml and https://github.com/kubeflow/pipelines/blob/master/manifests/kustomize/third-party/metacontroller/base/cluster-role.yaml. The pipelines profile controller just provides the input for metacontroller.
You also patch the permissions for "workflowtasksets/status" in many files which adds a lot of code duplication.
The PSS restricted stuff looks good.
ca40a3c to
23456c9
Compare
|
/retest |
1 similar comment
|
/retest |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
e7e61b8 to
e85dac2
Compare
|
please rebase to master |
e85dac2 to
6ca0446
Compare
…and eliminate per-component driver pods Previously, each driver component was executed as a separate Kubernetes pod, leading to significant overhead in pod scheduling, resource consumption, and control plane load, especially for pipelines with many small components. This change replaces the pod-per-driver model with a centralized driver invoked via an Argo Workflow executor plugin using a sidecar-based execution model. A new driver service is introduced and exposed via an HTTP API, allowing Argo to delegate driver execution without spawning dedicated pods. Key changes: - Remove pod-per-driver execution model - Introduce centralized driver implemented as an Argo executor plugin - Add driver-plugin sidecar configuration via Kubernetes ConfigMaps - Add new driver implementation (backend/src/driver) with HTTP API - Add required Kubernetes manifests, including RBAC roles and bindings, for the plugin - Integrate plugin configuration into all kustomize deployment variants - Update CI, TLS handling, and debugging scripts to support plugin-based execution - Improve logging with context-aware driver logger - Persist driver logs to S3 and expose them in the UI Benefits: - Eliminates driver pod churn and reduces scheduling overhead - Reduces pressure on Kubernetes control plane (API server, scheduler, etcd) - Improves execution latency - Enables more efficient and scalable pipeline execution - Provides better observability: driver logs are now available in the UI and persistently stored in S3 This change introduces a new execution model for driver components and lays the groundwork for further optimizations in pipeline runtime. Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
- fix rbac granting - drop redundant patch Signed-off-by: arpechenin <arpechenin@avito.ru>
- rewrite tests - move updated logic from the kubeflow@f8ee99c#diff-937c57a877a615c510d37e38309f0b159c85595d2dfa908d9e9ef05c227064da Signed-off-by: arpechenin <arpechenin@avito.ru>
Replaced the namespace-scoped Role with a ClusterRole to avoid creating the same role in every namespace. Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
Signed-off-by: arpechenin <arpechenin@avito.ru>
- added explanation for list pods RBAC - enhanced post-e2e test debugging - tweaked driver resources - added comments clarifying why extra RBAC is required - fixed images to match master - removed secret duplication and hardcoded namespace - update compiler golden files Signed-off-by: arpechenin <arpechenin@avito.ru>
0a89003 to
499c666
Compare
Driver templates now run as Argo executor plugins instead of workflow container templates, so run-specific env injection no longer applies to them. Store driver runtime values under driver-plugin args and initialize task plugins from those args, while keeping container env injection for launcher templates unchanged. Signed-off-by: arpechenin <arpechenin@avito.ru>
499c666 to
5dda2a6
Compare
|
/retest |
|
Well, all tests are finally passing On Monday I'll post an update with the current status and a summary of what changed during the rebase. |
|
Hi @mprahl, could you please take a look at the MLflow runtime config changes for the driver plugin? Since the driver is now an Argo executor plugin, its env vars and pod annotations are only available for global plugin configuration. Because of that, I no longer pass Previously the flow was:
Now I detect the driver plugin via The driver plugin then parses I also added MLflow RBAC ( |
backup: save central-driver-poc branch before merge
This branch is a backup of central-driver-poc before merging master.
The merge caused too many conflicts for a clean rebase, so the diffs
were extracted into this backup branch to preserve all work safely.
All discussions and comments remain in the original branch.
POC for #12023
Resolves: #12642
feat(backend): introduce centralized driver via Argo executor plugin and eliminate per-component driver pods
Previously, each driver component was executed as a separate Kubernetes pod,
leading to significant overhead in pod scheduling, resource consumption,
and control plane load, especially for pipelines with many small components.
This change replaces the pod-per-driver model with a centralized driver invoked via an Argo Workflow executor plugin using a sidecar-based execution model.
A new driver service is introduced and exposed via an HTTP API, allowing Argo to delegate driver execution without spawning dedicated pods.
Key changes:
Benefits:
This change introduces a new execution model for driver components and lays the groundwork for further optimizations in pipeline runtime.
Description of your changes:
Checklist: